Decision Tree Models¶



Decision tree models, including classifier tree models, are a type of supervised machine learning algorithm that is widely used for both classification and regression tasks. Here's a brief introduction to classifier tree models, covering their purpose, strengths, weaknesses, and the types of data they work best on:

Purpose:¶

Classifier tree models are used for making decisions based on input features. They recursively split the data into subsets based on the feature values, creating a tree structure where each leaf node corresponds to a class label. During the training process, the algorithm learns a set of decision rules that best separates the classes.

Strengths:¶

  1. Interpretability: Decision trees are easily interpretable and can be visualized. The rules learned by the model are intuitive and can be understood by non-experts.
  2. Non-linearity: Decision trees can model complex, non-linear relationships in the data.
  3. Feature Importance: They provide a measure of feature importance, indicating which features contribute the most to the decision-making process.
  4. Handle Mixed Data Types: Decision trees can handle a mix of categorical and numerical features without requiring additional preprocessing.

Weaknesses:¶

  1. Overfitting: Decision trees are prone to overfitting, capturing noise in the training data. This can be addressed by using techniques like pruning or using ensemble methods.
  2. Instability: Small changes in the data can lead to different tree structures, making them sensitive to variations in the training set.
  3. Greedy Nature: The algorithm makes locally optimal decisions at each node, which may not lead to the globally optimal tree.

Data:¶

Classifier tree models can work well on various types of data, including:

  • Categorical Data: Decision trees can handle categorical features naturally, making them suitable for problems with categorical variables.
  • Mixed Data Types: They can work with a mix of numerical and categorical features.
  • Binary and Multiclass Classification: Decision trees are versatile and can handle binary and multiclass classification problems.
  • Small to Medium-sized Datasets: They are suitable for datasets of moderate size, but their performance may degrade on very large datasets.

Use Cases:¶

  • Fraud Detection: Decision trees can be used to identify patterns associated with fraudulent transactions.
  • Medical Diagnosis: In healthcare, decision trees can assist in diagnosing diseases based on patient information.
  • Customer Segmentation: Decision trees can help segment customers based on their behavior or characteristics.

In practice, decision trees are often used as building blocks for more sophisticated models, such as random forests or gradient boosting, which address some of the weaknesses associated with individual decision trees.


What is Snap ML?¶

Snap ML is a machine learning library developed by IBM for use with Apache Spark. It is designed to accelerate and scale machine learning workflows on Spark, allowing data scientists and developers to train and deploy machine learning models efficiently.

Here are some key features and aspects of Snap ML:

  • Performance Optimization: Snap ML is designed to enhance the performance of machine learning tasks on large-scale distributed computing frameworks, such as Apache Spark. It leverages hardware acceleration, including GPUs, to speed up training and inference processes.

  • Compatibility with Spark: Snap ML integrates seamlessly with Apache Spark, making it easier to incorporate machine learning into Spark-based big data processing pipelines. This compatibility allows users to take advantage of Spark's distributed computing capabilities.

  • Support for Various Machine Learning Algorithms: Snap ML provides a set of machine learning algorithms that cover tasks such as classification, regression, and clustering. These algorithms are optimized for performance and scalability.

  • Distributed Training: The library supports distributed training of machine learning models, enabling the processing of large datasets across multiple nodes in a Spark cluster.

  • Scalability: Snap ML is designed to scale horizontally, making it suitable for handling large datasets and training complex models in a distributed computing environment.

In [ ]:
# Snap ML is available on PyPI. To install it simply run the pip command below.
# !pip install snapml
In [ ]:
# %%bash
# missing library was throwing an error during import
# brew install libomp 

1. Download data from Kaggle¶


In [ ]:
# install the opendatasets package
!pip install opendatasets
import opendatasets as od

# download the dataset (this is a Kaggle dataset)
# during download you will be required to input your Kaggle username and password
od.download("https://www.kaggle.com/mlg-ulb/creditcardfraud")
In [ ]:
# Use URL if Kaggle fails...
# url= "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/creditcard.csv"
# raw_data=pd.read_csv(url)

2. Import Packages¶


In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from sklearn.preprocessing import StandardScaler, normalize
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import compute_sample_weight
from sklearn.metrics import roc_auc_score
import time
import snapml
%matplotlib inline

3. Read and Summarize Data¶


In [7]:
# read the input data
raw_data = pd.read_csv('creditcardfraud/creditcard.csv')
print("There are " + str(len(raw_data)) + " observations in the credit card fraud dataset.")
print("There are " + str(len(raw_data.columns)) + " variables in the dataset.")
print("There are " + str(len(raw_data.Time.unique())) + " unique elements in the Time column.")
# display the first rows in the dataset
raw_data.head()
There are 284807 observations in the credit card fraud dataset.
There are 31 variables in the dataset.
There are 124592 unique elements in the Time column.
Out[7]:
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 2.69 0
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 ... 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 378.66 0
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 ... -0.108300 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 123.50 0
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.817739 ... -0.009431 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 69.99 0

5 rows × 31 columns

Replicates for Classification Tree Analysis (ChatGPT)¶

In the context of Classification Tree Analysis, when you mention "replicas," it's possible that you are referring to the concept of replication in the context of statistical or machine learning experiments. Replication involves creating multiple copies or instances of a dataset to assess the stability and reliability of a model's performance. Below are some reasons why replicas might be used in the context of Classification Tree Analysis:

  1. Bootstrap Aggregating (Bagging): Replicas are often used in techniques like Bootstrap Aggregating or Bagging. In Bagging, multiple bootstrap samples (replicas) are created by randomly drawing samples with replacement from the original dataset. A classification tree is then trained on each bootstrap sample, and the final prediction is obtained by averaging (for regression) or voting (for classification) across all trees. This helps to reduce overfitting and improve the stability of the model.

  2. Random Forests: Random Forests, an ensemble method based on Bagging, extends the concept by introducing additional randomness. In addition to creating replicas through bootstrap sampling, Random Forests also randomly select a subset of features for each split in the tree-building process. This further enhances the diversity of the individual trees and improves the overall model's generalization.

  3. Cross-Validation: Replicas can be used in the context of cross-validation. In k-fold cross-validation, the dataset is divided into k subsets, and the model is trained and evaluated k times, each time using a different subset as the test set. This process is repeated to create multiple replicas of the cross-validation procedure, providing a more robust estimate of the model's performance.

  4. Assessing Variability: Replicas help in assessing the variability or uncertainty associated with model predictions. By training and evaluating the model on multiple replicas, you can get a sense of how stable the model is under different subsamples of the data. This information is valuable for understanding the reliability of the model's predictions.

  5. Hyperparameter Tuning: Replicas may also be used in hyperparameter tuning processes, where different configurations of hyperparameters are evaluated on multiple replicas of the training data. This helps in finding hyperparameter settings that generalize well across different subsets of the data.

In summary, the use of replicas in Classification Tree Analysis, especially in ensemble methods like Bagging and Random Forests, aims to improve model stability, reduce overfitting, and provide a more reliable estimate of the model's performance on unseen data.

4. Inflate and Re-summarize Data¶


In [8]:
n_replicas = 10

# inflate the original dataset for the sake of having a larger dataset
big_raw_data = pd.DataFrame(np.repeat(raw_data.values, n_replicas, axis=0), columns=raw_data.columns)

print("There are " + str(len(big_raw_data)) + " observations in the inflated credit card fraud dataset.")
print("There are " + str(len(big_raw_data.columns)) + " variables in the dataset.")

# display first rows in the new dataset
big_raw_data.head()
There are 2848070 observations in the inflated credit card fraud dataset.
There are 31 variables in the dataset.
Out[8]:
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0.0
1 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0.0
2 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0.0
3 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0.0
4 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0.0

5 rows × 31 columns

Pie chart to visualize value counts¶

In [9]:
# get the set of distinct classes
labels = big_raw_data.Class.unique()

# get the count of each class
sizes = big_raw_data.Class.value_counts().values

# plot the class value counts
fig, ax = plt.subplots()
ax.pie(sizes, labels=labels, autopct='%1.3f%%')
ax.set_title('Target Variable Value Counts')
plt.show()

As shown above, the Class variable has two values: 0 (the credit card transaction is legitimate) and 1 (the credit card transaction is fraudulent). Thus, you need to model a binary classification problem. Moreover, the dataset is highly unbalanced, the target variable classes are not represented equally. This case requires special attention when training or when evaluating the quality of a model. One way of handing this case at train time is to bias the model to pay more attention to the samples in the minority class. The models under the current study will be configured to take into account the class weights of the samples at train/fit time.

Histigram of data set values¶

In [10]:
print("Minimum amount value is ", np.min(big_raw_data.Amount.values))
print("Maximum amount value is ", np.max(big_raw_data.Amount.values))
print("90% of the transactions have an amount less or equal than ", np.percentile(raw_data.Amount.values, 90))

flat_data = big_raw_data.Amount.values.flatten()

# Create a histogram
fig = go.Figure(data=[go.Histogram(x=flat_data, nbinsx=50)])

# Update layout
fig.update_layout(
    title='Histogram of All Values Across All Columns',
    xaxis_title='Values',
    yaxis_title='Frequency'
)

# Show the plot
fig.show()
Minimum amount value is  0.0
Maximum amount value is  25691.16
90% of the transactions have an amount less or equal than  203.0

5. Data Preprocessing¶


In [11]:
# data preprocessing such as scaling/normalization is typically useful for 
# linear models to accelerate the training convergence

# standardize features by removing the mean and scaling to unit variance
big_raw_data.iloc[:, 1:30] = StandardScaler().fit_transform(big_raw_data.iloc[:, 1:30])
data_matrix = big_raw_data.values

# X: feature matrix (for this analysis, we exclude the Time variable from the dataset)
X = data_matrix[:, 1:30]

# y: labels vector
y = data_matrix[:, 30]

# data normalization
X = normalize(X, norm="l1")

# print the shape of the features matrix and the labels vector
print('X.shape=', X.shape, 'y.shape=', y.shape)
X.shape= (2848070, 29) y.shape= (2848070,)

6. Dataset Train test split¶


In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)       
print('X_train.shape=', X_train.shape, 'Y_train.shape=', y_train.shape)
print('X_test.shape=', X_test.shape, 'Y_test.shape=', y_test.shape)
X_train.shape= (1993649, 29) Y_train.shape= (1993649,)
X_test.shape= (854421, 29) Y_test.shape= (854421,)

7. Build a Decision Tree Classifier Model:¶


with Scikit-Learn¶

In [40]:
# compute the sample weights to be used as input to the train routine so that 
# it takes into account the class imbalance present in this dataset
w_train = compute_sample_weight('balanced', y_train)

# import the Decision Tree Classifier Model from scikit-learn
# for reproducible output across multiple function calls, set random_state to a given integer value
sklearn_dt = DecisionTreeClassifier(max_depth=4, random_state=35)

# train a Decision Tree Classifier using scikit-learn
t0 = time.time()
sklearn_dt.fit(X_train, y_train, sample_weight=w_train)
sklearn_time = time.time()-t0
print("[Scikit-Learn] Training time (s):  {0:.5f}".format(sklearn_time))
[Scikit-Learn] Training time (s):  17.29543

with Snap ML¶

In [41]:
# if not already computed, 
# compute the sample weights to be used as input to the train routine so that 
# it takes into account the class imbalance present in this dataset
# w_train = compute_sample_weight('balanced', y_train)

# import the Decision Tree Classifier Model from Snap ML
from snapml import DecisionTreeClassifier

# Snap ML offers multi-threaded CPU/GPU training of decision trees, unlike scikit-learn
# to use the GPU, set the use_gpu parameter to True
# snapml_dt = DecisionTreeClassifier(max_depth=4, random_state=45, use_gpu=True)

# to set the number of CPU threads used at training time, set the n_jobs parameter
# for reproducible output across multiple function calls, set random_state to a given integer value
snapml_dt = DecisionTreeClassifier(max_depth=4, random_state=45, n_jobs=4)

# train a Decision Tree Classifier model using Snap ML
t0 = time.time()
snapml_dt.fit(X_train, y_train, sample_weight=w_train)
snapml_time = time.time()-t0
print("[Snap ML] Training time (s):  {0:.5f}".format(snapml_time))
[Snap ML] Training time (s):  1.75161

8. Evaluate the Scikit-Learn and Snap ML Decision Tree Classifier Models¶


In [42]:
# Snap ML vs Scikit-Learn training speedup
training_speedup = sklearn_time/snapml_time
print('[Decision Tree Classifier] Snap ML vs. Scikit-Learn speedup : {0:.2f}x '.format(training_speedup))

# run inference and compute the probabilities of the test samples 
# to belong to the class of fraudulent transactions
sklearn_pred = sklearn_dt.predict_proba(X_test)[:,1]

# evaluate the Compute Area Under the Receiver Operating Characteristic 
# Curve (ROC-AUC) score from the predictions
sklearn_roc_auc = roc_auc_score(y_test, sklearn_pred)
print('[Scikit-Learn] ROC-AUC score : {0:.3f}'.format(sklearn_roc_auc))

# run inference and compute the probabilities of the test samples
# to belong to the class of fraudulent transactions
snapml_pred = snapml_dt.predict_proba(X_test)[:,1]

# evaluate the Compute Area Under the Receiver Operating Characteristic
# Curve (ROC-AUC) score from the prediction scores
snapml_roc_auc = roc_auc_score(y_test, snapml_pred)   
print('[Snap ML] ROC-AUC score : {0:.3f}'.format(snapml_roc_auc))
[Decision Tree Classifier] Snap ML vs. Scikit-Learn speedup : 9.87x 
[Scikit-Learn] ROC-AUC score : 0.966
[Snap ML] ROC-AUC score : 0.966

9. Build a Support Vector Machine Model:¶


with Scikit-Learn¶

In [49]:
# import the linear Support Vector Machine (SVM) model from Scikit-Learn
from sklearn.svm import LinearSVC

# instatiate a scikit-learn SVM model
# to indicate the class imbalance at fit time, set class_weight='balanced'
# for reproducible output across multiple function calls, set random_state to a given integer value
sklearn_svm = LinearSVC(class_weight='balanced', random_state=21, loss="hinge", dual=True, fit_intercept=False)

# train a linear Support Vector Machine model using Scikit-Learn
t0 = time.time()
sklearn_svm.fit(X_train, y_train)
sklearn_time = time.time() - t0
print("[Scikit-Learn] Training time (s):  {0:.2f}".format(sklearn_time))
[Scikit-Learn] Training time (s):  50.47
/Users/davidfoutch/anaconda3/lib/python3.10/site-packages/sklearn/svm/_base.py:1242: ConvergenceWarning:

Liblinear failed to converge, increase the number of iterations.

with Snap ML¶

In [26]:
# import the Support Vector Machine model (SVM) from Snap ML
from snapml import SupportVectorMachine

# in contrast to scikit-learn's LinearSVC, Snap ML offers multi-threaded CPU/GPU training of SVMs
# to use the GPU, set the use_gpu parameter to True
# snapml_svm = SupportVectorMachine(class_weight='balanced', random_state=25, use_gpu=True, fit_intercept=False)

# to set the number of threads used at training time, one needs to set the n_jobs parameter
snapml_svm = SupportVectorMachine(class_weight='balanced', random_state=25, n_jobs=4, fit_intercept=False)
# print(snapml_svm.get_params())

# train an SVM model using Snap ML
t0 = time.time()
model = snapml_svm.fit(X_train, y_train)
snapml_time = time.time() - t0
print("[Snap ML] Training time (s):  {0:.2f}".format(snapml_time))
[Snap ML] Training time (s):  3.57

11. Evaluate the Scikit-Learn and Snap ML Support Vector Machine Models¶


In [25]:
# compute the Snap ML vs Scikit-Learn training speedup
training_speedup = sklearn_time/snapml_time
print('[Support Vector Machine] Snap ML vs. Scikit-Learn training speedup : {0:.2f}x '.format(training_speedup))

# run inference using the Scikit-Learn model
# get the confidence scores for the test samples
sklearn_pred = sklearn_svm.decision_function(X_test)

# evaluate accuracy on test set
acc_sklearn  = roc_auc_score(y_test, sklearn_pred)
print("[Scikit-Learn] ROC-AUC score:   {0:.3f}".format(acc_sklearn))

# run inference using the Snap ML model
# get the confidence scores for the test samples
snapml_pred = snapml_svm.decision_function(X_test)

# evaluate accuracy on test set
acc_snapml  = roc_auc_score(y_test, snapml_pred)
print("[Snap ML] ROC-AUC score:   {0:.3f}".format(acc_snapml))
[Support Vector Machine] Snap ML vs. Scikit-Learn training speedup : 13.55x 
[Scikit-Learn] ROC-AUC score:   0.984
[Snap ML] ROC-AUC score:   0.985

12. Evaluate the SVM Models Using the hinge loss metric¶


In [24]:
# get the confidence scores for the test samples
sklearn_pred = sklearn_svm.decision_function(X_test)
snapml_pred  = snapml_svm.decision_function(X_test)

# import the hinge_loss metric from scikit-learn
from sklearn.metrics import hinge_loss

# evaluate the hinge loss from the predictions
loss_snapml = hinge_loss(y_test, snapml_pred)
print("[Snap ML] Hinge loss:   {0:.3f}".format(loss_snapml))

# evaluate the hinge loss metric from the predictions
loss_sklearn = hinge_loss(y_test, sklearn_pred)
print("[Scikit-Learn] Hinge loss:   {0:.3f}".format(loss_snapml))

# the two models should give the same Hinge loss
[Snap ML] Hinge loss:   0.228
[Scikit-Learn] Hinge loss:   0.228
In [ ]: